Search CORE

42 research outputs found

Precision-Recall Curves Using Information Divergence Frontiers

Author: Djolonga Josip
Lucic Mario
Cuturi Marco
Bachem Olivier
Bousquet Olivier
Gelly Sylvain
Publication venue
Publication date: 01/03/2003
Field of study

Despite the tremendous progress in the estimation of generative models, the development of tools for diagnosing their failures and assessing their performance has advanced at a much slower pace. Recent developments have investigated metrics that quantify which parts of the true distribution is modeled well, and, on the contrary, what the model fails to capture, akin to precision and recall in information retrieval. In this paper, we present a general evaluation framework for generative models that measures the trade-off between precision and recall using R\'enyi divergences. Our framework provides a novel perspective on existing techniques and extends them to more general domains. As a key advantage, this formulation encompasses both continuous and discrete models and allows for the design of efficient algorithms that do not have to quantize the data. We further analyze the biases of the approximations used in practice.Comment: Updated to the AISTATS 2020 versio

arXiv.org e-Print Archive

Washington and Lee University School of Law

University of Richmond

Scalable k-Means Clustering via Lightweight Coresets

Author: Arthur David
Bachem Olivier
Bertin-Mahieux Thierry
Jiri Matousek
Reddi Sashank J
Publication venue
Publication date: 01/01/2018
Field of study

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k-means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel, and the resulting coresets are smaller. We further show that the proposed approach naturally generalizes to statistical k-means clustering and that, compared to existing results, it can be used to compute smaller summaries for empirical risk minimization. In extensive experiments, we demonstrate that the proposed algorithm outperforms existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref